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ABSTRACT 


In  the  course  of  designing  or  evaluating  signal  processing  algorithms,  we  often  must  determine  the 
computational  workload  needed  to  implement  the  algorithms  on  a  digital  computer.  The  floating-point 
operation  (flop)  counts  for  real  versions  of  the  most  common  signal  processing  kernels  are  well  docu¬ 
mented.  However,  the  flop  counts  for  kernels  operating  on  complex  inputs  are  not  as  readily  found.  This 
report  collects  the  flop  count  expressions  for  both  real  and  complex  kernels  and  also  presents  brief  outlines 
of  the  derivations  for  the  flop  count  expressions.  These  flop  count  expressions  are  summarized  below. 


Signal  Processing  Kernel 

Computational  Complexity 

Real  Input 

Complex  Input 

matrix-matrix  multiplication 

2mnp 

Smnp 

fast  Fourier  transform 

InlogjH 

5nlog2n 

Householder  QR  decomposition 

1 

oo 

forward  or  back  substitution 

2 

n 

4n^ 

eigenvalue  decomposition:  eigenvalues  only 

4  3 

r 

16  3 

T” 

eigenvalue  decomposition:  eigenvalues  and 
eigenvectors 

9n 

23n 

singular  value  decomposition:  singular  values 
only 

A  2  4  3 

4mn  --n 

singular  value  decomposition:  singular  values 
and  left  singular  vectors 

2  2 

4m  n  12mn 

16m^n  +  24mn^ 

singular  value  decomposition:  singular  values 
and  right  singular  vectors 

Amn^+Mn 

I6mn^  +  24n^ 

singular  value  decomposition:  singular  values, 
left  and  right  singular  vectors 

Am'  n  +  llmr^  +  I3n^ 

I6m^  n  +  lAmn"  +  29r^ 

iii 


In  the  table  above,  the  parameters  in  the  computational  complexity  expressions  are: 


•  the  dimensions  of  the  two  multiplicands  -  m  x  n  and  nx  p  -  for  the  matrix-matrix  multipli¬ 
cation 

•  the  length  of  the  vector  n  for  the  fast  Fourier  transform 

•  the  size  of  the  triangular  system  n  for  forward  and  back  substitutions 

•  the  dimensions  of  the  input  matrix  mxn  for  the  Householder  QR  decomposition,  eigen¬ 
value  decomposition,  and  singular  value  decomposition. 
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1.  INTRODUCTION 


In  the  course  of  designing  or  evaluating  signal  processing  algorithms,  we  must  often  determine  the 
computational  workload  needed  to  implement  the  algorithms  on  a  digital  computer.  The  floating-point 
operation  (flop)  counts  for  real  versions  of  the  most  common  signal  processing  kernels  are  well  docu¬ 
mented.  However,  the  flop  counts  for  kernels  operating  on  complex  inputs  are  not  as  readily  found.  This 
report  collects  the  flop  count  expressions  for  both  real  and  complex  kernels  and  also  presents  brief  outlines 
of  the  derivations  for  the  flop  count  expressions.  Specifically,  the  following  computational  kernels  will  be 
treated: 

•  matrix-matrix  multiplication 

•  fast  Fourier  transform  (FFT) 

•  Householder  QR  factorization 

•  forward  and  back  substitutions 

•  eigenvalue  decomposition 

•  singular  value  decomposition 
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2.  MATRIX  MULTIPLICATION 


2.1  REAL  MATRIX  MULTIPLICATION 

The  elements  of  the  product  matrix  C€  of  real  matrices  A  e  and  B€  are 

given  by  ([4]): 


p 

C  =  AB=>Cij= 

*=  I 


(1) 


Therefore,  for  each  of  the  mn  elements  in  C ,  we  perform  p  multiplications  and  p  additions,  giving 
us  a  flop  count  of; 

flop  count  =  2mnp  (2) 


2.2  COMPLEX  MATRIX  MULTIPLICATION 


The  elements  of  the  product  matrix  C  €  Cf"^" 
given  by  ([4]): 


of  complex  matrices  Ae  and  Be 


are 


p 

C  =  AB=>  Cij  =  5^  (3) 

k=  I 

For  each  of  the  ntn  elements  in  C ,  we  perform  p  multiplications  and  p  additions.  However,  these 
multiplications  and  additions  are  between  complex  numbers. 

The  complex  sum  of  two  complex  scalars  requires  two  flop:  one  flop  to  add  the  real  components,  and 
another  flop  to  add  the  imaginary  components. 

The  complex  product  z  of  two  complex  scalars  x  =  a  +  jb  and  y  =  c  +  jd,  where  a,  b,  c,  and  d 
are  all  real  scalars  and  j  =  in  this  context,  is: 

z  =  Jcy  (4) 

=  (a  +  jb)  X  (c  +  Jd) 

=  (ac - bd)  +  Jx  (ad  +  be) 

and  requires  six  flop: 

•  four  multiplications;  ac ,  bd ,  ad ,  and  be 

•  two  additions;  (ac  -  bd)  and  (ad  +  be) 
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Therefore,  for  each  of  the  mn  elements  in  C,  we  perform  8p  flop,  giving  us  a  flop  count  of: 

flop  count  =  8mnp  (5) 

2.3  ALTERNATIVE  COMPLEX  MATRIX  MULTIPLICATION 

We  can  compute  the  product  matrix  Ce  of  complex  matrices  Ae  and  fie  in 

an  alternative  fashion  that  reduces  the  workload  by  approximately  25%  ([10]). 

First,  we  separate  the  multiplicands  A  and  fi  into  their  real  and  imaginary  components.  Let 

A^  e  be  the  real  part  of  A,  Aj€  be  the  imaginary  part  of  A ,  fi^  e  !R^^"  be  the  real  part  of 

fi ,  and  fi,-  6  be  the  imaginary  part  of  fi : 


A  =  +  jAf 

(6) 

fi  =  fi,  +  yfi. 

(7) 

Then,  the  product  matrix  C  is: 

C  =  AB 

(8) 

=  (A^  +  yA,.)x(fi^  +  >fi,.) 

=  (A,fi,-A,fi,)  +  y(A^,  +  A,fi,) 


Now  consider  the  real  matrices  G  e  RT  ^  ^  and  //  e  R’’^"  ,  which  we  deflne  thusly: 


G  =  A,+  A,.  (9) 

H  =  fi,-fi,  (10) 

Then 

GH  =  A^,-A,fi,  +  A,fi,-A,fi,.  (11) 

GH  +  A^Bi-AiB^  =  A^B^-AfBi  (12) 

GH  +  A,fi,.  -  A,fi,  +  j(A,Bi  +  A,.fi,)  =  A^B,  -  Afi-,  +  j{A,B;  +  A,.fi,)  ( 1 3) 

=  C 


Computing  G  =  A^  +  A,-  requires  mp  flop.  Computing  H  =  B^- fi,-  requires  pn  flop.  Computing 
the  product  GH  requires  2mnp  flop.  Computing  the  product  A,.fi,-  also  requires  2mnp .  Computing  the 
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product  AjB^  requires  another  2mnp  flop.  Computing  GH  +  A^B^- for  the  real  portion  of  C 
requires  2mn  flop.  Computing  A^Bj  +  AjB^  for  the  imaginary  portion  of  C  requires  mn  flop.  The  total 
flop  count  for  computing  C  =  AB  using  this  alternative  method  is 

flop  count  =  6mnp  +  3mn  +  (m  +  n)p  (14) 

which  is  approximately  three-quarters  of  the  Smnp  flop  needed  to  compute  the  product  using  the  straight¬ 
forward  method. 


3.  FAST  FOURIER  TRANSFORM 


3.1  COMPLEX  FFT 


The  discrete  Fourier  transform  (DFT)  X  of  a  finite-length  sequence  jc  of  length  N  is  given  by  ([7]); 

N-i 

X[k]=  '^x[n]W^‘'\k  =  0,1,.. .,N -I  (15) 

n  =  0 

where 


(16) 


If  we  were  to  compute  the  DFT  by  explicitly  evaluating  the  sums,  the  flop  count  would  be  0{N^) . 
We  can  dramatically  reduce  the  workload  by  decomposing  the  original  DFT  into  successively  smaller  DFT 
computations  in  decimation-in-time  algorithms  ([7]).  Let  us  consider  the  case  where  N  is  a  power  of  two: 

N  =  2'* ,  where  v  is  an  integer. 

Since  TV  is  an  even  integer,  we  can  separate  jc[nJ  into  two  sequences  of  length  N/2:  the  first 
sequence  consists  of  the  even-numbered  points  in  x[n],  while  the  second  sequence  consists  of  the  odd- 
numbered  points  in  jc[n] .  We  can  now  rewrite  Equation  15  as: 

x[k]=  X 

n  even  n  odd 

Let  us  substitute  n  =  2r  for  even  n  and  n  =  2r  +  1  for  odd  n : 

(N/2)- I  (N/2)- I 

xik]=  2;  x[2r]W/,^"*+  2;  (18) 

r-0  r=0 

(A^/2)-l  (yV/2)-I 

=  x[2r](W;,Y*+Vr/  x(2r+l](W^Y* 

r=0  r=0 

We  also  note  that  =  Wyy/2  • 

^^2  ^  ^-2,(2«/N)  ^  ^-y2./(N/2)  ^ 

We  can  therefore  rewrite  Equation  18: 
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{N/2)^\  {N/2)-\ 

XW=  ^  x[2r]W;,/2^*+W'/  5;  x[2r+l]H'^//*  (20) 

r=0  r=0 

=  Gm  +  w/f/w 

Both  G[k]  and  H[k]  are  (N/2)  -point  DFTs,  where  G(A:]  is  the  DFT  of  the  even-numbered  points 
of  x[n]  and  H[k]  is  the  DFT  of  the  odd-numbered  points  of  j:[n] .  If  the  length  of  the  original  sequence 
N  is  a  power  of  2,  we  can  use  this  technique  to  continue  to  further  decompose  the  DFTs  until  we  have  only 
DFTs  of  length  2  (see  Figure  1  for  the  case  of  N  =  8 ).  We  would  have  a  total  of  v  =  log2N  stages. 


Figure  1 :  Decimation-in-time  decomposition  of  an  8-point  DFT  into  2-point  DFTs 


If  we  expand  Figure  I  with  the  final  expression  in  Equation  20,  we  arrive  at  the  flow  graph  shown 
below  in  Figure  2  (for  the  case  N  =  8): 
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Figure  2;  Flow  graph  of  the  complete  decimation-in-time  decomposition  of  an  8-point  DFT  ((?]) 


In  Figure  2,  we  used  a  notation  where  branches  (arrows)  entering  a  node  (circle)  denotes  the  addition 
of  the  quantities  from  which  the  branches  originated,  and  a  coefficient  next  to  the  head  of  a  branch  denotes 
a  scaling  of  the  quantity  by  the  coefficient.  If  no  coefficient  is  indicated,  the  scaling  factor  is  assumed  to  be 
unity  by  default.  See  Figure  3  for  an  illustration  of  this  notation. 
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At  each  non-input  node,  we  perform  a  complex  multiplication  and  a  complex  addition,  which 
together  require  8  flop.  There  are  log2^  stages  of  nodes,  giving  us  a  total  flop  count  of  SyVlog,^ .  We  can 
reduce  this  flop  count  further  by  noting  that  ([7]) 

=  e-^<2K/Ar)yv/2  ^  ^  ^21) 

and  that  therefore 


The  butterfly  computation  shown  in  Figure  4  can  be  simplified,  as  shown  in  Figure  5. 


(22) 
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Figure  5:  Simplified  butterfly  computation  requiring  only  one  complex  multiplication  ([7]) 

Using  this  identity,  we  will  need  to  multiply  by  W/^  half  as  many  times,  resulting  in  the  flow  graph 


shown  below  in  Figure  6  (for  the  case  N  =  S). 
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For  each  pair  of  nodes,  we  perform  one  multiply,  one  add,  and  one  subtract,  for  a  total  of  10  flop  per 
pair  of  nodes  or  an  average  of  5  flop  per  node: 

flop  count  =  5/Vlog2A^  (23) 

Because  Wj^  depends  only  on  N,  it  can  be  pre-computed,  and  its  evaluation  does  not  contribute  to 
the  workload. 

This  technique  is  not  limited  to  radix  2  FFTs;  it  may  be  applied  to  other  radices. 
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3.2  REAL  FFT 


First,  we  note  that  the  DFT  of  a  real  vector  is  conjugate  even:  if  x  e  ,  the  DFT  X  e  of  x  has 
the  properties  that 


Re{X[k]}  =  Re{X[N-k]} 

(24) 

Im{X[k]}  =  -Im{X[N-k]} 

(25) 

If  we  take  advantage  of  the  resulting  additional  structure,  we  can  reduce  the  workload  necessary  to  evalu¬ 
ate  the  FFT  of  a  real  vector. 

The  computation  of  the  FFT  of  a  real  vector  uses  two  algorithms.  The  first  algorithm  computes  the 
DFT  of  two  real  vectors  of  equal  length  through  a  single  FFT  of  a  complex  vector  of  the  same  length  ([8]). 
This  algorithm  will  be  used  to  perform  FFTs  on  the  two  halves  of  the  input  vector. 

Let  /  and  g  be  two  real  vectors  of  length  N,  where  /  consists  of  samples  /[O]  through  f[N  -  1  ] 
and  g  consists  of  samples  g[0]  through  g[N  -  1  ] .  Let  F  be  the  DFT  of  /  and  G  be  the  DFT  of  g . 

The  fact  that  /  is  real  means  that  Re{F}  (the  real  part  of  F)  is  an  even  function  and  lm{F}  (the 
imaginary  part  of  F)  is  an  odd  function.  An  even  function  Re{F}  is  defined  as  follows: 

Re{F[k]}  =  /?e{F[Ar-^]}  forit^iO  (26) 

An  odd  function  Im{F}  is  defined  as  follows: 

/m{F[0]}  =  0  (27) 

Im{F[k]}  =  -Im{F{N-k]}  forked  (28) 

If  g  is  real,  then  jg  is  imaginary,  Re{jG}  is  an  odd  function,  and  Im{jG}  is  an  even  function: 


Fc{yG(0]}  =  0 

(29) 

Re{jG[k]}  =  -Re{jG[N-k]}  for  it  ^0 

(30) 

Im{jG[k]}  =  Im{jGlN-k]}  for  kit  0 

(31) 

Let  us  define 

h  =  f  +  jg 

(32) 

Then  H ,  the  DFT  of  /i ,  is 

H  =  F  +  jG 

(33) 

Let  us  first  consider  the  real  part  of  H : 
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Re{H}  =  Re{F}  +  Re{jG}  (34) 

where  Re{F}  is  even  and  Re{jG}  is  odd.  Given  these  conditions,  we  can  determine  that 

/?eOG[0]}  =  0-»/?e{F[0]}  =  /?e{//[0]}  (35) 

Re{Fll:N-l])  =  11} +  I:- 1:11) 

ReUClUN-m  = 

where  //[yV  -  1 :  -  1 ;  1  ]  are  elements  1  through  N-  \  of  //  in  reverse  order. 

Next,  let  us  consider  the  imaginary  part  of  H : 

Im{H}  =  lm{F}  +  Im{jG}  (38) 

where  lm{F}  is  odd  and  Im{jG}  is  even.  Given  these  conditions,  we  can  determine  that 

/m{F[0]}  =  0^/m{yG[0]}  =  Im{H[0]}  (39) 

faOC[l.A>-ll)  =  '"(Hll:W-U}H-fa{tflW-l:-l:lU  ,4,) 

Of  course,  once  we  have  jG ,  we  can  compute  G ; 

G  =  -jG  X  j  (42) 

or 

G  =  Im{JG}-JxRe{JG}  (43) 

A  MATLAB  implementation  of  this  algorithm  is  shown  below  in  Figure  7. 
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%  f  and  g  are  the  real  input  vectors 

h  =  f  +  i*g; 

H  =  fft(h); 

F(l)  =  real(H(l)); 

F(2:N)  =  0.5  *  (real (H(2 :N) )  +  real (H(N:-1:2) ) ) ; 
jG(2:N)  =  0.5  *  (real (H(2 :N) )  -  real (H(N:-1:2) ) ) ; 

jG(l)  =  i  *  iinag(H(l)); 

F(2:N)  =  F(2:N)  +  0.5i  *  (iinag(H(2  :N)  )  -  imag  (H  (N: -1 : 2 )  )  )  ; 
jG(2:N)  =  jG(2:N)  +  0.5i  *  (iinag(H(2  :N)  )  +  imag  (H  (N: -1 :  2  )  )  )  ; 

G  =  -jG  *  i; 

Figure  7:  MATLAB  code  for  computing  the  DFT  of  two  real  vectors  using  one  complex  FFT 

The  workload  for  this  algorithm  is  dominated  by  the  FFT  of  the  complex  vector  h ,  and  is  therefore 
approximately  5yVlog2A^  flop. 

The  second  algorithm  needed  to  compute  a  real  FFT  applies  a  conjugate-even  butterfly  to  an  input 
vector  ([11]).  This  algorithm  is  used  to  combine  the  two  conjugate-even  half-length  vectors  produced  by 

the  first  algorithm  to  produce  a  single  conjugate-even  full-length  vector.  Specifically,  if  re  where 

L  =  2^  and  ^  >  1 ,  the  algorithm  in  Figure  8  overwrites  x  with  where  is  the  conjugate- 

even  butterfly  matrix  for  an  L -point  FFT. 
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if  q  -=  1 

L_star  =  L  /  2; 
p  =  L  /  4; 

tau  =  X ( 1 ) ; 

x(l)  =  tau  +  x(L_star  +  1) ; 
x(L_star  +  1)  =  tau  -  x(L_star  +  1) ; 
x(3  *  p  +  1)  =  -x(3  *  p  +  1); 

for  count  =  2:p 

c  =  cos (2  *  pi  *  (count  -  1)  /  L); 
s  =  sin(-2  *  pi  *  (count  -  1)  /  L) ; 

u(count)  =  c  *  x(L_star  +  count)  -  s  *  x(L_star  +  p  +  count); 
v(count)  =  s  *  x(L_star  +  count)  +  c  *  x(L_star  +  p  +  count); 
end  %  for  count 

y (2 :p)  =  x(2 :p) ; 

x(2:p)  =  x(p  +  2:L_star); 

for  count  =  2:p 

x( count)  =  y( count)  +  u  (count); 
x(p  +  count)  =  y(p  -  count)  -  u(p  -  count); 
x(L_star  +  count)  =  z(count)  +  v ( count ) ; 
x(L_star  +  p  +  count)  =  “2(p  -  count)  +  v(p  -  count) 
end  %  for  count 

else 

tau  =  X ( 1 ) ; 

X ( 1 )  =  tau  +  X ( 2 ) ; 

X  ( 2 )  =  tau  -  X ( 2 )  ; 
end  %  if  q  ~=  1 


Figure  8:  MATLAB  code  to  apply  a  conjugate-even  butterfly  ([11]) 

This  algorithm  requires  approximately  5L/2  flop. 

To  compute  the  FFT  X  of  a  single  real  vector  jc  €  ([1 1]),  the  input  vector  is  first  divided  into  two 

half-length  vectors:  let  =  x[\  \2:n]  and  =  x[2:2:n] .  Next,  we  use  the  algorithm  in  Figure  7  to 

compute  »  which  are  the  DFTs  of  and  x^y^„ ,  respectively.  Finally,  we  use  the 
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algorithm  in  Figure  8  to  compute  X  = 


(«) 


''odd 


(ce) 


(ce) 


Computing  the  DFTs  of  the  half-length  vectors  requires 


flop  count  =  5(- 


(44) 


=  ^n[(log2n)-l] 


Applying  the  conjugate-even  butterfly  matrix  requires  an  additional  -n  flop,  bringing  the  total  flop  count 
for  an  FFT  of  a  real  vector  to 


flop  count  =  -nlogjn 


(45) 
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4.  HOUSEHOLDER  QR  DECOMPOSITION 


4.1  REAL  HOUSEHOLDER  QR  DECOMPOSITION 


The  QR  decomposition  factors  a  real  input  matrix  A  e  into  an  orthogonal  matrix  Qe 

and  an  upper  triangular  matrix  Re  r'"^"  ([4]); 

A  =  QR  (46) 

Q^Q  =  I,  QQ"  =  /  (47) 

The  Householder  QR  factorization  algorithm  entails  the  application  of  a  series  of  orthogonal  trans¬ 
formations  Pi  to  the  input  matrix  A  so  that  the  portion  of  the  matrix  below  the  diagonal  will  become  zero; 


P„...PiA  =  R  (48) 

where  each  matrix  P,-  has  the  form 

P  =  /-Pvv^  (49) 

To  avoid  explicitly  computing  the  Householder  reflection  /  -  Pvv  ,  we  use  the  following  implemen¬ 
tation  ([4]): 

deflne  w  =  pA^v  (50) 

(/-Pvv^)A  =  A-Pvv^A  (51) 

=  A  -  v(Pv^A) 

/I  T 
=  A  -  vw 


A  MATLAB  implementation  of  the  real  Householder  QR  decomposition,  which  overwrites  the  input 
matrix  with  the  upper  triangular  factor  (the  Householder  QR  decomposition  does  not  return  the  orthogonal 
factor),  is  shown  below  in  Figure  9. 


19 


[num_rows,  nuin_cols]  =  size  (A); 
for  col  =  l:num_cols 

%  Compute  the  Householder  vector  v 

[v,  beta]  =  house(A(col:num_rows,  col)); 

%  Apply  the  Householder  vector  to  the  remainder  of  the  matrix 
w  =  beta  *  V’  *  A(col :num_rows ,  col:num_cols); 

A(col  :num_rows,  col  inunucols )  =  A(col  :nuiTurows,  col  :num_cols )  -  v  *  w; 

%  Zero  out  the  remainder  of  the  column. 

A ((col  +  1) :num_rows,  col)  =  0; 
end  %  for  col 

Figure  9:  MATLAB  code  for  a  Householder  QR  decomposition  ([4]) 

The  function  house  (x)  (see  Figure  10)  computes  a  vector  v  and  a  scalar  p  such  that 
P  =  /“Pvv^  is  orthogonal,  where  P  =  -p-,and  Px  =  ||jc||2^i  {e^  =  [1,0,  ...,0]^). 

V  V 
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if  (isreal(x)) 
n  =  length (x); 
sigma  =  x(2:n)'  *  x(2:n); 

V  =  [1;  x(2 :n) ] ; 
if  sigma  ==  0 

beta  =  0; 
else 

mu  =  sqrt(x(l)'"2  +  sigma); 
if  x(l)  <=  0 

v(l)  =  x(l)  -  mu; 
else 

v(l)  =  -sigma  /  (x(l)  +  mu) ; 
end  %  if 

beta  =  2  *  v(l)  ^  2  /  (sigma  +  (v(l)  ^  2)); 

V  =  V  /  V  ( 1 )  ; 

end  %  if 
else 

V  =  x; 

nx  =  norm(x) ; 
v(l)  =  x(l)  +  nx; 
beta  =  1  /  (nx  *  (nx  +  x(l))); 
end  %  if 


Figure  10:  MATLAB  code  for  computing  the  Householder  vector  ([4]) 

In  computing  the  flop  count  for  a  real  Householder  QR  decomposition,  we  consider  the  application 
of  the  Householder  reflections  only:  the  flop  count  for  the  computation  of  the  Householder  vector  is 
ignored,  as  it  is  much  smaller  than  the  flop  count  for  the  application  of  the  Householder  reflections. 

For  the  i  th  iteration,  computing  i4  v  requires  2im  -  i)(n  -  i)  flop.  Computing  w  =  v  requires 

T 

m-i  flop,  but  is  ignored  as  it  is  a  secondary  term.  Computing  vw  requires  (m  -  i)(n  -  i)  flop.  Comput- 
T 

ing  A-vw  requires  another  (m-i)(n-i)  flop.  The  total  flop  count  for  the  ith  column  is 
4(/n  —  j)(n  -  i)  flop. 

Because  we  apply  the  Householder  reflection  to  the  portion  of  the  input  matrix  that  is  on  or  to  the 
right  of  the  diagonal  (see  Figure  1 1),  the  length  of  the  columns  decrease  by  one  as  we  operate  on  succes¬ 
sive  columns. 
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a  a  a  a  a 
0  a  a  a  a 
OObbb 
OObbb 
00  bbb 
OObbb 


Elements  a  have  already  been 
upper  triangularized 

The  Householder  reflection  is 
applied  to  elements  b 


Figure  1 1 :  Portion  of  input  matrix  to  which  apply  the  Householder  reflection 


The  total  flop  count  for  the  entire  mxn  input  matrix  can  be  computed  as  follows: 

n 

flop  count  =  4(m-i){n-i) 


I  =  1 


=  4  ^  [mn  -  (m  +  n)i  +  i^] 
1  =  1 


=  4  mn  -  4  y  (m  n)i  -f  4  P 


i = 1  1=1 


1  =  I 


n  n 


.2 


=  Amn  -  4(m  +  «)  ^  i  +  4  ^  i 

i  =  I  i  =  I 


=  Amn^  -  4(m  ^  ^  > 


=  Amn^  -  2(m  +  n)(n^  +  n)  +  |(2/i^  +  3n^  +  n) 


=  Amn^  -  2mn^  -  2n^  —  2mn  -  2n^  +  +  2n^  + 

3  3 


=  2mn^  —  —  2mn  +  ^/i 

3  3 


We  then  drop  the  lower  order  terms  to  arrive  at  the  canonical  estimated  flop  count  for  real  matrices: 


(52) 
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(53) 


flop  count  =  2mt?  -  |n^ 


=  Iri 


2 


4.2  COMPLEX  HOUSEHOLDER  QR  DECOMPOSITION 


If  the  input  matrix  A  e  c”’'"  is  complex,  the  QR  decomposition  factors  A  into  a  unitary  matrix 
Qe  and  an  upper  triangular  matrix  Re  ([4]): 


A  =  QR 


(54) 


=  I,  qq'"  =  I 


(55) 


A  MATLAB  implementation  of  the  complex  Householder  QR  decomposition,  which  overwrites  the 
input  matrix  with  the  upper  triangular  factor  (the  Householder  QR  decomposition  does  not  return  the  uni¬ 
tary  factor),  is  the  same  as  the  implementation  for  a  real  Householder  QR  decomposition,  which  was  given 
in  Figure  9. 

In  computing  the  flop  count  for  a  complex  Householder  QR  decomposition,  we  consider  the  applica¬ 
tion  of  the  Householder  reflections  only;  the  flop  count  for  the  computation  of  the  Householder  vector  is 
ignored. 

To  avoid  explicitly  computing  the  Householder  reflection  I  -  P vv  ,  we  use  the  following  implemen¬ 


tation: 


define  vv  =  PA^v 

(/  —  pvv^)A  =  A  —  Pw^A 


(57) 


(56) 


=  A  -  v(pv^A) 


=  A  —  vw' 


H 


H  H 

For  the  ith  iteration,  computing  A  v  requires  8(m-i)(n-/)  flop.  Computing  w  =  PA  v 

requires  2(m-/)  flop,  but  is  ignored.  Computing  vw  requires  6(m-/)(n-/)  flop.  Computing 

A  -  vw  requires  2(m  -  i)(n  -  i)  flop.  The  total  flop  count  for  the  i  th  column  is  16(m  -  i)(n  -  i)  flop,  or 
four  times  the  flop  count  for  the  real  Householder  QR  decomposition.  The  canonical  total  estimated  flop 
count  for  the  complex  Householder  QR  decomposition  is  therefore 
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flop  count  =  8n 


2 


(58) 
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5.  FORWARD  AND  BACK  SUBSTITUTIONS 


5.1  REAL  FORWARD  AND  BACK  SUBSTITUTIONS 


A  forward  substitution  allows  us  to  solve  the  lower  triangular  system  Lx  =  b  for  xe  'R"  given 

lower  triangular  L  6  (R"  "  and  &  e  (R”  .  Fundamentally,  the  forward  substitution  process  is  as  follows 
([4]).  First,  we  solve  for  the  first  unknown  jt, : 


Li|jc,  -  (59) 

jc,  =  T—  (60) 

^11 

Next,  we  use  this  value  of  x,  to  solve  for  X2 : 

LjiX,  +L22X2  =  &2 
^22-*2  ~ 

2 - 7 - 

^22 

Continuing  forward,  we  can  solve  for  all  elements  of  the  vector  x . 

A  MATLAB  implementation  of  the  forward  substitution,  which  overwrites  the  input  vector  b  with 
the  solution  vector  x,  is  shown  below  in  Figure  12. 


(61) 

(62) 

(63) 


b(l)  =  b(l)  /  L(l,  1) ; 
for  row  =  2:nuin_rows 

b(row)  =  (b(row)  -  L(row,  l:row  -  1)  *  b(l;row  -  1) )  /  L(row,  row); 
end  %  for  row 


Figure  12:  MATLAB  code  for  a  forward  substitution  ([4]) 


In  computing  the  flop  count  for  a  forward  substitution,  we  ignore  the  workload  necessary  to  compute 
b. 

- — ,  as  it  will  be  a  secondary  term. 

^11 


For  row  /,  computing  the  dot  product  |  x  ,  requires  2(/  -  1 )  flop.  Subtracting  this  dot 
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product  from  bj  requires  one  flop,  but  is  ignored  as  it  is  a  secondary  term.  Multiplying  the  difference  by 
1  /L:  j  also  requires  one  flop  and  is  also  ignored. 

The  flop  count  for  all  n  elements  of  x  can  be  computed  as  follows: 

n 

flop  count  =  ^  2(i  -  1 )  (64) 

/  =  I 


=  -2n  +  2  y  I 


=  -2n  +  2 


/=  I 

n(n+  1) 


2 

=  -  +  n  -h  n 

2 

=  n  —n 

We  then  drop  the  lower  order  term  to  arrive  at  the  canonical  estimated  flop  count  for  a  forward  sub¬ 
stitution  on  real  matrices: 

2 

flop  count  =  n  (65) 

A  back  substitution  is  the  analog  of  the  forward  substitution  for  upper  triangular  matrices,  letting  us 
solve  the  tri£uigular  system  Ux  =  b  for  x  e  Ol"  given  upper  triangular  U  e  ‘iC  ^  "  and  be  ^R"  .  Instead  of 
starting  with  the  first  unknown  and  working  forward,  we  start  with  the  last  unknown  x„  and  work  back. 

A  MATLAB  implementation  of  the  back  substitution,  which  overwrites  the  input  vector  b  with  the 
solution  vector  Jt,  is  shown  below  in  Figure  13. 


b(num_rows)  =  b(num_rows)  /  U(num_rows,  nuin_rows); 
for  row  =  (nuin_rows  - 

b(row)  =  (b(row)  -  U(row,  row  +  l:nuin_rows)  *  b(row  +  1 :  nuin_rows ) )  ... 

/  U(row,  row) ; 
end  %  for  row 

Figure  13:  MATLAB  code  for  a  back  substitution  ([4]) 

The  derivation  of  the  flop  count  for  a  back  substitution  is  almost  identical  to  the  derivation  for  the 
forward  substitution  flop  count,  giving  us  the  canonical  estimated  flop  count  for  a  back  substitution: 
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(66) 


flop  count  =  h 


2 


5.2  COMPLEX  FORWARD  AND  BACK  SUBSTITUTIONS 

The  algorithms  used  for  real  forward  and  back  substitutions  may  be  used  for  complex  forward  and 
back  substitutions. 

In  a  forward  substitution,  for  row  i ,  computing  the  dot  product  L,-  ,  x  x,  ,  requires  8(/  -  1 ) 

flop.  Subtracting  this  dot  product  from  requires  two  flop,  but  is  ignored  as  it  is  a  secondary  term.  Multi¬ 
plying  the  difference  by  1  /L,-  ,•  also  requires  two  flop  and  is  also  ignored.  The  flop  count  for  row  i  is  four 

times  the  flop  count  for  the  corresponding  computation  for  a  real  forward  substitution.  The  canonical  total 
estimated  flop  count  for  a  complex  forward  substitution  is  therefore 


2 


flop  count  =  4n 


(67) 


The  derivation  of  the  flop  count  for  a  complex  back  substitution  is  almost  identical  to  the  derivation 
for  the  complex  forward  substitution  flop  count,  giving  us  the  canonical  estimated  flop  count  for  a  back 
substitution: 


flop  count  =  4n 


2 


(68) 
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6.  EIGENVALUE  DECOMPOSITION 


6.1  REAL  EIGENVALUE  DECOMPOSITION 

The  symmetric  Schur  decomposition  of  a  real  symmetric  matrix  A  g  computes  an  orthogonal 

matrix  Qg  such  that  ([4]): 

Q^AQ  =  A  =  (69) 

This  decomposition  of  A  results  in  the  eigenvectors  being  the  columns  of  Q  and  the  corresponding 
eigenvalues  being  the  diagonal  elements  of  A  ([2]): 

From  the  definition  of  eigenvectors  and  eigenvalues: 

7*  T 

Aq  =  Xq,  where  q-  q^  =  1  and  q^  qj  =  0  for  /  ^  ;  (70) 

AQ  =  AC,  where  Q  =  QQ^  =  I  (71) 

Q^AQ  =  Q^AQ  (72) 

Let  5  =  AG  (73) 

n 

^ij  =  X  (74) 

k=  I 

=  X^qjj  because  X^^  =  0  for  i^k 

Let  C  =  Q^AQ  =  Q^B  (75) 
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^ij  =  X 

k=  1 


(76) 


“  X  ^ki(\^kj) 

k=  I 
n 

-  X  ^k^ki^kj 

k=  1 

^  j  0  if  /7t ;• 

1  if  i  =  ;■ 


..Q'AQ  =  A 


(77) 


As  the  first  step  in  the  implementation  of  the  symmetric  Schur  decomposition,  we  tridiagonalize  the 
input  matrix: 


(78) 


The  tridiagonal  matrix  T  is  derived  from  the  input  matrix  A  through  orthogonal  Householder  reflec¬ 
tions  Qj : 


X  X  X  X  X 

y  y  0  0  0 

X  X  X  X  X 

y  y  y  0  0 

X  X  X  X  X 

=> 

0  y  y  y  0 

X  X  X  X  X 

0  0  y  y  y 

X  X  X  X  X 

0  0  0  y  y 

Qt  AQj  —  T  (79) 

A  MATLAB  implementation  of  the  real  Householder  tridiagonalization,  which  overwrites  the  input 
matrix  with  the  tridiagonal  matrix  (if  Qj  is  desired,  it  must  be  separately  formed),  is  shown  below  in  Fig¬ 
ure  14. 
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for  k  =  l:{num_rows  -  2) 

%  Compute  the  Householder  vector  v. 

[v,  beta]  =  house(A({k  +  1) :num_rows,  k) ) ; 

%  Apply  the  Householder  reflection. 

p  =  beta'  *  A((k  +  1)  :nuin_rows,  (k  +  1)  :num_rows)  *  v; 
w  =  p  -  (beta  *v*v’  *p/2); 

A{(k  +1),  k)  =  -norm(A{(k  +  1) :num_rows,  k) ) ; 

A(k,  (k  +  D)  =  A{{k  +  1),  k); 

A((k  +  1)  :nuiTL.rows,  (k  +  1)  :niim_rows)  ... 

=  A(  (k  +  1)  :num_rows,  (k  +  1)  :num_rows)  -  v*w’  -  w*v'; 

%  Zero  out  remainder  of  row  and  column. 

A((k  +  2) :num_rows,  k)  =  0; 

A{k,  (k  +  2):num_rows)  =  0; 
end  %  for  k 

%  Apply  phase-only  correction  to  last  super-  and  sub-diagonal  elements  to 
%  make  them  real 

last_super_diag  =  A({num_rows  -  1),  num_rows); 

A  (  ( nuit\_rows  -  1),  num_rows)  =  abs  (last_super_diag)  ; 

A{num_rows,  {num_rows  -  1))  =  abs  { last_super_diag)  ; 


Figure  14:  MATLAB  code  for  a  real  Householder  tridiagonalization  ([4]) 

In  computing  the  flop  count  for  a  real  Householder  tridiagonalization,  we  consider  the  application  of 
the  Householder  reflections  only:  the  flop  count  for  the  computation  of  the  Householder  vector  v  is 
ignored. 

For  iteration  k,  computing  the  product  of  A(Jt+  1  :num_rows,  A:  +  l:num_rows)  and  v  requires 

2(n  -  k)  flop.  Scaling  this  product  by  P  to  compute  p  requires  n-k  flop  and  is  disregarded.  Computing 
w  requires  3(n-k)  +  2  is  also  disregarded.  Computing  ||/1(^+  1  :num_rows,  ^:)|l2  requires  2(n-k)  flop 
and  is  also  disregarded. 

7*  2  7*  T 

Computing  vw  requires  (n-k)  flop.  The  product  wv  is  the  transpose  of  vw  and  requires  no 

T  T 

additional  computations.  Taking  advantage  of  the  symmetry  of  the  output,  subtracting  both  vw  and  wv 

2  2 

from  A  requires  (n  -  k)  flop.  The  total  flop  count  for  iteration  k  is  4(n  -  k)  flop. 

The  total  flop  count  for  the  entire  n  x  n  matrix  can  be  computed  as  follows; 
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(80) 


n-2 

flop  count  =  ^  4(n  -  k)^ 

k=  I 


^n-2  n-2  n-2  ^ 

n^-2j^nk  + 

Vit  =  1  k=r  I  *  =  1  y 

f  n-2  n-2  n-2  \ 


=  4x 


=  4x 


n'^l-2nY,k+  X 


|[n^(n-2)]-^2n 


(n-2)(n-l) 


(n-2)(n-l)(2n-3) 

6 


_  Sn-\2n^  +  4n-24 
6 

We  then  drop  the  lower  order  terms  to  arrive  at  the  canonical  estimated  flop  count  for  real  matrices: 

flop  count  =  I  (81) 

To  evaluate  the  eigenvectors  in  addition  to  the  eigenvalues,  we  will  need  to  accumulate  the  House¬ 
holder  reflections  in  fir  •  product  of  the  Householder  reflection  matrices  is  equal  to 


Qt  -  Px...P„_2  (82) 

where  each  matrix  P^  is  the  Householder  reflection  matrix  for  loop  index  k  in  the  algorithm  given  in  Fig¬ 
ure  14  above.  Each  matrix  Pj  has  the  form 


Pk  = 


h  0  it 

0  P*  n-k 

k  n-k 


where  /*  is  the  ifc  x  A:  identity  matrix  and 

h=In-k-^^y 


(83) 


(84) 


for  iteration  k . 
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Noting  that  the  P*  portion  of  shrinks  as  k  increases,  we  can  reduce  the  workload  needed  to  com¬ 
pute  Qf  if  we  accumulate  this  product  starting  with  P„_2  Householder  transformation  matrix  with 
the  smallest  non- identity  submatrix)  rather  than  starting  with  P^  (the  Householder  transformation  matrix 

with  the  largest  non-identity  submatrix):  instead  of  accumulating  a  product  that  is  virtually  completely 
non-identity  from  the  beginning,  we  will  slowly  grow  the  submatrix  that  is  non-identity  ([6]). 

Let 

n-2 

e,  =  IT'’- 

l=Jt 

Then 

Qk-\  -  f^k-\Qk 

and 

Qt  =  Qi  (87) 

If  we  consider  the  non-identity  portion  of  the  product  1  “  Pk-\Qk^^^ 


where 


Qk^\  =  Pk-\Qk 


Qk  = 


/  0 

pQk 


(88) 


(89) 


Given  that  the  Householder  reflection  matrix  P*  _  i  is 

P*_,  =/-Pvv^  (90) 

we  can  express  the  product  in  Equation  88  thusly: 

Qk-i  =  (i-fiyy^)Qk  (91) 

=  Qk-^vv^Qk 

Let 

w  =  v^Qk  (92) 

Then 
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Qk-\  =  Qk-^vw 


(93) 


j* _  2 

Computing  the  product  w  =  v  Qi^  requires  2(n  -  k)  flop.  Scaling  w  by  P  requires  n-k  flop  and 

2  2 

is  disregarded.  Computing  pvw  requires  (n  -  k)  flop.  Computing  Q*  “  pvw  also  requires  (n-k)  ,  for  a 
total  of  4(n  -  k)  flop  for  iteration  k .  The  total  flop  count  to  accumulate  Qj-  for  a  real  input  matrix  is 


4  3 

flop  count  =  -n  (94) 

The  rest  of  the  work  in  the  symmetric  Schur  decomposition  is  performed  in  a  series  of  implicit  sym¬ 
metric  QR  steps  with  Wilkinson  shifts  ([4]).  This  algorithm  takes  as  an  input  an  unreduced  symmetric  trid- 

iagonal  matrix  T  e  and  overwrites  it  with  the  quantity  TZ .  A  matrix  is  said  to  be  unreduced  if  it 
has  no  zero  subdiagonal  entries.  The  matrix  Z  is  equal  to  the  product  of  Givens  rotations 

Z=G,...G„_,  (95) 


T 

and  has  the  property  that  Z  {T  -  p/)  is  upper  triangular.  The  scalar  p,  is  the  eigenvalue  of  the  2-by-2  prin¬ 
cipal  submatrix  of  T  that  is  closer  to  (the  element  in  the  matrix  T  in  the  n  th  row  and  n  th  column). 


There  is  an  easy  way  to  compute  the  eigenvalues  of  a  2  x  2  matrix.  First,  we  observe  that  the  eigen¬ 
values  A.  of  a  matrix  A  satisfy  the  characteristic  polynomial  ([4]) 

det{Xl-A)  =  0  (96) 

Expanding  Equation  96: 


/'r,  1 

X  0 

°\\  «12 

,[o  x_ 

®21  ^22 

y 

=  0 


(97) 


det 


/ 

X-fli, 

1 

_ 1 

\ 

\ 

1 

X  -  022 

y 

=  0 


(98) 


The  determinant  of  a  2  x  2  matrix  is  given  by 
\ 


det\ 


a  b 
c  d 


=  ad  — be 


Therefore,  Equation  98  is  equivalent  to 


(99) 
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{X  — a^^){X  — 022)  ~  (~Oy2)i~02i)  —  0 


(100) 


A.  —  (flji  +  022)^  +  (aii^22  “ ^12^21 )  ~  ® 
Solving  for  X, ,  we  have: 


X  = 


(a^^  +022)  ^  “  ^(^11^22  “  **12*^21) 


(101) 


(102) 


The  implicit  symmetric  QR  step  with  Wilkinson  shift  uses  the  algorithm  givens(a,  fc),  which 
returns  two  scalars  c  =  cos(6)  and  s  =  sin(6)  such  that 


- 

T 

c  s 

a 

r 

-s  c 

b 

0 

(103) 


A  MATLAB  implementation  of  the  givens  algorithm  is  shown  below  in  Figure  15. 


function;  [c ,  f]  =  givens (a,  b) 

if  b  =  0 
c  =  1 

j  =  0 

else 

if  |fc|>|fl| 

T  =  -a/b 

s  =  1/71+^ 
c  =  sx 
else 

T  =  ^b/a 

c  =  i/ViT? 

s  =  cx 

end  %  if 
end  %  if 


Figure  15;  MATLAB  code  for  the  givens  algorithm  ([4]) 

This  algorithm  requires  five  flop  and  a  single  square  root. 

A  shorthand  notation  is  often  used  for  a  real  Givens  rotation  matrix; 
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G(i.k,e)  = 


(104) 


1  ...  0  ...  0  ...  0 

0  ...  c  ...  s  ...  0 

0  ...  -s  ...  c  ...  0 

0  ...  0  ...  0  ...  I 

/  k 


When  applying  a  Givens  rotation,  we  do  not  explicitly  form  this  matrix;  instead,  we  take  advantage 
of  the  structure  in  the  Givens  rotation  matrix. 

Let  A  €  'R””'",  c  =  cos(0),  and  s  =  sin (6).  Then,  the  update  A  «—  G(i,  k,  0)^ A  affects  just 


rows  /  and  k  of  A: 

>4([i, /:],:)  = 


IT 


c  s 
—s  c 


A([i,A:],:) 


(105) 


A  MATLAB  implementation  of  this  update  is  shown  below  in  Figure  16. 


This  update  requires  only  6n  flop. 

Similarly,  the  update  A  «—  AG(i,  k,  0)  affects  only  columns  i  and  k  of  A: 


A(:,[/U])  =  A(:,[/,^J) 

A  MATLAB  implementation  of  this  update  is  shown  below  in  Figure  17. 


c  s 
-s  c 


(106) 
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This  update  also  requires  only  6n  flop. 

A  MATLAB  implementation  of  the  implicit  symmetric  QR  step  with  Wilkinson  shift  is  shown  below 
in  Figure  1 8. 


Figure  18:  MATLAB  code  for  an  implicit  symmetric  QR  step  with  Wilkinson  shift  ([4]) 


The  bulk  of  the  workload  is  found  in  the  for  loop.  For  each  k,  we  need  to  perform; 

•  five  flop  and  one  square  root  for  the  givens  algorithm 

T 

•  27  flop  to  compute  G*  T G*  (see  below) 
for  a  total  of  32  flop  and  one  square  root. 

The  flop  count  for  applying  the  Givens  rotation  G^  to  T  is  only  27  and  independent  of  the  matrix 
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T 

size  n  because  we  can  take  advantage  of  the  fact  that,  within  the  for  k  loop,  T Gj^  remains  symmetric, 

T 

and  that  most  of  the  elements  in  rows  k  and  il:  +  1  are  zero.  Consider  the  premultiplication  of  T  by 
(see  Figure  19). 


T 

To  form  the  product  T ,  we  only  need  to  compute  b' ,  d ,  e' ,  f ,  g' ,  and  A ,  even  though  we  are 
updating  a  total  of  eight  entries:  we  don’t  need  to  compute  a  value  we  know  to  be  zero,  and  we  only  need  to 
compute  e'  once.  Computing  these  scalars  requires  three  flop  per  scalar  (to  compute  either  ex,  -jt,  or 

5T|  +  cTj ).  for  a  total  of  18  flop. 

T 

We  can  similarly  take  advantage  of  the  structure  in  the  matrices  to  efficiently  postmultiply  G^  T  by 
Cj  (.see  Figure  20). 


T 

To  compute  the  product  G^  TG/^,  we  only  need  to  compute  d’ ,  e" ,  and  /" ,  even  though  we  are 
updating  eight  entries:  we  don’t  need  to  compute  a  value  we  know  to  be  zero,  and  we  can  take  advantage  of 
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T 

the  symmetry  in  G/^  T G/^  to  avoid  recomputing  b' ,  e” ,  g' ,  and  A .  Computing  these  scalars  requires  three 
flop  per  scalar,  for  a  total  of  nine  flop. 

For  all  k  ranging  from  1  to  n  -  1,  we  will  need  to  perform  32(n  -  1 )  flop  and  n  -  1  square  roots, 
which  we  round  to  30n  flop  and  n  square  roots  to  match  the  text  ([4]). 

If  we  want  to  accumulate  the  Givens  rotations  by  updating  an  input  orthogonal  matrix  Q  with 
QG^...G„_^ ,  we  will  require  6n  flop  to  apply  each  Givens  rotation  to  the  running  product,  for  a  total  of 

2 

6/i(/i  -  1 ) ,  or  approximately  6n  flop. 

A  MATLAB  pseudo-code  implementation  for  the  overall  algorithm  for  computing  the  symmetric 
Schur  decomposition  of  a  real  matrix  A  ,  which  overwrites  A  with  the  tridiagonal  matrix  T,  is  given  below 
in  Figure  21  (tol  is  a  tolerance  greater  than  the  unit  roundoff). 
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%  Tridiagonalize  the  input  matrix. 

Use  the  algorithm  in  Figure  14  to  compute  the  tridiagonalization 

Set  D  =  T  and,  if  Q  is  desired,  form  Q  = 

\mtil  q  =  n 

for  i  =  l:n  -  1 

end  %  if 
end  %  for  i 

Find  the  largest  q  and  the  smallest  p  such  that,  if 

0  0 

D22  0 

0  D33 

n-^p-^q  q 

is  diagonal  and  D22  is  unreduced 

Use  the  algorithm  in  Figure  18  to  update  0^2  • 

=  diag{/p.  2. 1/0^2  diag(/p.  2.  /,) 

If  Q  is  desired,  then  update  Q  : 

Q  =  Gdiag(/^,Z,/^) 
end  %  if 

end  %  until  q  =  n 

Figure  21 :  MATLAB  pseudo-code  for  a  real  symmetric  Schur  decomposition  ([4]) 


P 

n-p-^q 

9 


D  = 


^11 

0 

0 


then  D 
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if  q<n 


The  matrix  is  the  n  x  n  identity  matrix. 

The  computational  workload  for  the  real  symmetric  Schur  decomposition  is  found  principally  in  the 

4  3 

tridiagonalization  of  the  input  matrix,  which  requires  approximately  -n  flop.  There  are  0{n)  calls  to  the 
implicit  symmetric  QR  step  with  Wilkinson  Shift,  each  with  an  0(n)  flop  count.  Therefore,  the  flop  count 
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2 

for  the  implicit  symmetric  QR  steps  will  be  0{n  ) .  As  it  is  an  order  smaller  than  the  flop  count  for  the  trid- 
iagonalization,  this  flop  count  is  ignored  in  the  overall  flop  count,  giving  us  the  canonical  flop  count  for  the 
real  symmetric  Schur  decomposition: 

4  3 

flop  count  =  -n  (1 07) 

It  should  be  noted  here  that  the  actual  number  of  iterations  through  the  algorithm  in  Figure  21 
needed  to  converge  on  a  solution  is  not  deterministic.  We  estimate  this  number  of  iterations  to  be  approxi¬ 
mately  n ,  but  it  may  be  several  times  larger  than  n . 

If  we  want  to  accumulate  the  orthogonal  transformations  Q ,  we  will  need  to: 

•  accumulate  the  Householder  reflections  during  the  tridiagonalization  of  the  input  matrix,  at  a 

4  3 

cost  of  -n  flop 

•  accumulate  the  Givens  rotations  during  each  of  the  0(n)  implicit  symmetric  QR  steps  at  a 

2  3 

cost  of  6n  flop  per  iteration,  for  a  total  of  approximately  6n  flop 

If  we  add  to  these  two  items  to  the  workload  for  the  symmetric  Schur  algorithm  without  accumulat¬ 
ing  Q ,  we  can  compute  the  total  workload: 

flop  count  =  |/i^  +  6n^  +  |n^  (108) 

26  3 

We  round  this  figure  to  arrive  at  the  canonical  workload  for  the  symmetric  Schur  algorithm  if  we 
accumulate  Q : 

flop  count  =  9/1^  (1 09) 

6,2  COMPLEX  EIGENVALUE  DECOMPOSITION 

The  symmetric  Schur  decomposition  of  a  complex  symmetric  matrix  A  €  C"  ^  "  computes  a  unitary 
matrix  Qe  such  that: 

=  A  =  diag(X„...,X„)  (110) 

This  decomposition  of  A  results  in  the  eigenvectors  being  the  columns  of  Q  and  the  corresponding 
eigenvalues  being  the  diagonal  elements  of  A . 
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As  the  first  step  in  the  implementation  of  the  symmetric  Schur  decomposition,  we  tridiagonalize  the 
input  matrix.  The  tridiagonal  matrix  T  is  derived  from  the  input  matrix  A  through  unitary  Householder 
reflections  Qj : 

Qj^AQj  =  T  (111) 

A  MATLAB  implementation  of  the  complex  Householder  tridiagonalization,  which  overwrites  the 
input  matrix  with  the  tridiagonal  matrix  (if  is  desired,  it  must  be  separately  formed),  is  shown  below  in 
Figure  22. 


for  k  =  l:nuin_rows  -  2 

%  Compute  the  Householder  vector  v. 

X  =  A(col :num_rows,  col); 

V  =  x±e^M2^]  ;  where  Xj  =  re‘^ 

beta  =21  (v'  *  v) ; 

%  Apply  the  Householder  reflection, 
p  =  beta  *  A(k  +  l:n,  k  +  l:n)  *  v; 
w  =  p  -  (beta  *p'  *v/2)  *v; 

A(k  +  1,  k)  =  nonn(A(k  +  l:num_rows,  k)  )  ; 

A(k,  k  +  1)  =  A(k  +  1,  k)  ; 

A  ( k  +  1 :  nunurows ,  k  +  1 :  nuiturows )  ... 

=  A ( k  +  1 : num_rows ,  k  +  1 : num_rows )  -  v*w'  -  w*v'; 
end  %  for  k 

Figure  22:  MATLAB  code  for  a  complex  Householder  tridiagonalization 

Even  though  the  input  matrix  A  is  complex,  the  resultant  tridiagonal  matrix  T  is  real. 

In  computing  the  flop  count  for  a  complex  Householder  tridiagonalization,  we  consider  the  applica¬ 
tion  of  the  Householder  reflections  only:  the  flop  count  for  the  computation  of  the  Householder  vector  v  is 
ignored. 

For  iteration  computing  the  product  of  A(A:+  1  :num_rows, /:  +  l:num_rows)  and  v  requires 

S(n-k)  flop.  Scaling  this  product  by  P  to  compute  p  requires  2(n-k)  flop  and  is  disregarded.  Com¬ 
puting  w  requires  l2{n-k)  +  3  is  also  disregarded.  Computing  ||A(A  +  1  :num_rows,  ^)||2  requires 
4(n  -  k)  flop  and  is  also  disregarded. 

Computing  vw  requires  6(n  -  k)  flop.  The  product  wv  is  the  transpose  of  vw  and  requires  no 
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M  H 

additional  computations.  Taking  advantage  of  the  symmetry  of  the  output,  subtracting  both  vw  and  wv 

2  2 
from  A  requires  2(n-k)  flop.  The  total  flop  count  for  iteration  k  is  16(n  -k)  flop,  or  four  times  the 

flop  count  for  the  real  Householder  tridiagonalization.  The  total  estimated  flop  count  for  the  complex 

Householder  tridiagonalization  is  therefore 

flop  count  =  (112) 

To  evaluate  the  eigenvectors  in  addition  to  the  eigenvalues,  we  will  need  to  accumulate  the  House¬ 
holder  reflections  in  Qj- .  The  product  of  the  Householder  reflection  matrices  is  equal  to 

er  =  ^i-^-2  (113) 

where  each  matrix  is  the  Householder  reflection  matrix  for  the  loop  index  k  in  the  algorithm  given  in 
Figure  22  above.  We  use  the  same  technique  for  accumulating  Qj  that  was  used  for  the  real  Householder 
tridiagonalization.  For  each  iteration  k ,  we  are  computing  the  quantity 

Qk-\=Qk-^vw  (114) 

where 

w  =  (115) 

Computing  the  product  w  =  v  Q*  requires  %{n-k)  flop.  Scaling  w  by  P  requires  2{n  —  k)  flop 

and  is  disregarded.  Computing  pvw  requires  6(n  -  k)  flop.  Computing  2*  -  P^w  requires  2(/j  -  k)  , 

for  a  total  of  16(n  -k)  flop  for  iteration  k ,  or  four  times  the  flop  count  for  the  real  Householder  tridiago¬ 
nalization.  The  total  flop  count  to  accumulate  Qj-  for  a  complex  input  matrix  is 

flop  count  -  "y  (116) 

The  rest  of  the  work  in  the  symmetric  Schur  decomposition  is  performed  in  a  series  of  implicit  sym¬ 
metric  QR  steps  with  Wilkinson  shifts.  This  algorithm  takes  as  an  input  an  unreduced  symmetric  tridiago¬ 
nal  matrix  T  e  and  overwrites  it  with  the  quantity  Z^TZ.  The  matrix  Z  is  equal  to  the  product  of 

Givens  rotations 

Z=G,...C„_,  (117) 

and  has  the  property  that  Z^(7’  -  p/)  is  upper  triangular.  Because  the  tridiagonal  matrix  T  is  real,  the  Giv¬ 
ens  rotations  G,-  are  also  real,  and,  therefore,  the  algorithm  for  implicit  symmetric  QR  steps  with  Wilkin- 
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son  shifts  that  was  given  in  Figure  18  may  be  used  here  ([9]).  The  woridoad  for  this  algorithm,  as  was 
previously  indicated,  is  approximately  30n  flop  and  n  square  roots. 

If  we  want  to  accumulate  the  Givens  rotations  by  updating  an  input  unitary  matrix  Q  with 
2G I . . .  G„  _  1 ,  we  will  require  6n  flop  to  apply  each  Givens  rotation  to  the  running  product  for  each  of  the 

2 

real  and  imaginary  halves  of  (2 .  for  a  total  of  12n(n  -  1 ) ,  or  approximately  12n  flop. 

A  MATLAB  pseudo-code  implementation  for  the  overall  algorithm  for  computing  the  symmetric 
Schur  decomposition  of  a  complex  matrix  A ,  which  overwrites  A  with  the  tridiagonal  matrix  T,  is  given 
below  in  Figure  23  (tol  is  a  tolerance  greater  than  the  unit  roundoff). 
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%  Tridiagonalize  the  input  matrix. 

Use  the  algorithm  in  Figure  22  to  compute  the  tridiagonalization 

Set  D^T  and,  if  Q  is  desired,  form  Q  -  P^.,,P^_2 

until  q  =  n 

for  i  =  l:n  -  1 

if  K+I./I  =  K./*i|^«»x(K,i  +  K*i.,>i|) 

^1+1,1  ” 

=  0; 

end  %  if 
end  %  for  i 

Find  the  largest  q  and  the  smallest  p  such  that,  if 

P 

n-p-q 
R 


and  D22  is  unreduced 


Use  the  algorithm  in  Figure  18  to  update  0^2  ' 

Dj2  =  diag(/p,  2,  1^/d22  diag(/p,  Z,  /^) 

If  Q  is  desired,  then  update  Q  : 

Q=  (2diag(/p.2.y 
end  %  if 

end  %  until  q  =  n 

Figure  23:  MATLAB  pseudo-code  for  a  complex  symmetric  Schur  decomposition 


D  = 


D 


72 


0 

0 

^33 


then  D 


33 


n-p^q  q 
is  diagonal 


if  q<n 


The  computational  workload  for  the  complex  symmetric  Schur  decomposition  is  found  principally 
in  the  tridiagonalization  of  the  input  matrix,  which  requires  approximately  flop.  There  are  0{n) 

calls  to  the  implicit  symmetric  QR  step  with  Wilkinson  Shift,  each  with  an  0(n)  flop  count.  Therefore,  the 
flop  count  for  the  implicit  symmetric  QR  steps  will  be  0(n  ).  As  it  is  an  order  smaller  than  the  flop  count 
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for  the  tridiagonalization,  this  flop  count  is  ignored  in  the  overall  flop  count,  giving  us  the  flop  count  for  the 
complex  symmetric  Schur  decomposition: 

flop  count  =  -y  (118) 

If  we  want  to  accumulate  the  orthogonal  transformations  Q ,  we  will  need: 

•  accumulate  the  Householder  reflections  during  the  tridiagonalization  of  the  input  matrix,  at  a 
cost  of  flop 

•  accumulate  the  Givens  rotations  during  each  of  the  0{n)  implicit  symmetric  QR  steps  at  a 

2  3 

cost  of  12/1  flop  per  iteration,  for  a  total  of  approximately  \2n  flop 

If  we  add  to  these  two  items  the  workload  for  the  symmetric  Schur  algorithm  without  accumulating 
Q ,  we  can  compute  the  total  workload: 

flop  count  =  ^/i^  +  12/1^  +  (119) 

68  3 

=  T” 

We  round  this  figure  to  arrive  at  the  workload  for  the  symmetric  Schur  algorithm  if  we  accumulate 

Q- 

flop  count  =  23/1^  (1 20) 
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7.  SINGULAR  VALUE  DECOMPOSITION 


7.1  REAL  SINGULAR  VALUE  DECOMPOSITION 

The  singular  value  decomposition  (S VD)  of  a  real  matrix  A  e  'iC'  ^ " ,  where  m>n,  computes 
orthogonal  matrices  U  e  and  V  e  such  that  ([4]): 

U^AV  =  diag(o„a2....,a„)  (121) 

where 

a,  (122) 

The  scalars  a,  are  the  singular  values  of  A  .  The  matrix  U  contains  the  left  singular  vectors,  while 
the  matrix  V  contains  the  right  singular  vectors. 

As  the  first  step  in  the  implementation  of  the  SVD  algorithm,  we  upper  bidiagonalize  the  input 
matrix: 


X  X  X  X  X 

y  y  0  0  0 

X  X  X  X  X 

0  y  y  0  0 

X  X  X  X  X 

0  0  y  y  0 

X  X  X  X  X 

0  0  0  y  y 

X  X  X  X  X 

0  0  0  0  y 

X  X  X  X  X 

0  0  0  0  0 

X  X  X  X  x^ 

0  0  0  0  0 

The  nxn  bidiagonal  matrix  B  is  derived  from  the  input  matrix  A  through  the  application  of  orthog¬ 
onal  Householder  reflections  U g  and  Vg : 


=  Us^AVg  (124) 

A  MATLAB  implementation  of  the  real  Householder  bidiagonalization,  which  overwrites  the  input 
matrix  with  the  bidiagonal  matrix  (if  Ug  and  Vg  are  desired,  they  must  be  separately  formed),  is  shown 
below  in  Figure  24. 


B 

0 
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for  j  =  l:nuin_cols 

%  Compute  the  Householder  vector  v. 

[v,  beta]  =  house (A( j :num_rows,  j)); 

%  Apply  the  Householder  reflection  (premultiply) . 
w  =  beta  *  v‘  *  A ( j  : n\im_rows ,  j:num_cols); 

A ( j : num_rows ,  j:num_cols)  ... 

=  A { j  : num_rows ,  j:num_cols)  -  v  *  w; 

%  Zero  out  the  rest  of  the  column. 

A ( ( j  +  1 ) : num_rows ,  j )  =  0 ; 

if  (j  <=  {num_cols  -  2)) 

%  Compute  the  Householder  vector  v. 

[v,  beta]  =  house((A(j,  (j  +  1 )  :nuiTUCols )  )  '  )  ; 

%  Apply  the  Householder  reflection  (postmultiply) . 
w  =  beta*  *  A ( j  : nuiTurows ,  (j  +  1)  :num_cols)  *  v; 

A  { j  :  nxim_rows ,  (j  +  1)  :num_cols)  ... 

=  A ( j : num_rows ,  (j  +  1) :num_cols)  -  w  *  v*  ; 

%  Zero  out  the  rest  of  the  row. 

A(j,  j  +  2:num_cols)  =  0; 
end  %  if  (j  <=  (num_cols  -  2)) 
end  %  for  j 

%  Apply  phase-only  correction  to  last  super-diagonal  element  to  make  it  real 
last_super_diag  =  A((num_cols  -  1),  num_cols) ; 

A((nuirucols  -  1),  num_cols)  =  abs  (last_super_diag)  ; 


Figure  24:  MATLAB  code  for  a  Householder  bidiagonalization  ([4]) 

In  computing  the  flop  count  for  a  real  Householder  bidiagonalization,  we  consider  the  application  of 
the  Householder  reflections  only:  the  flop  count  for  the  computation  of  the  Householder  vector  v  is 
ignored. 

For  iteration  y,  computing  the  product  of  the  transpose  of  A(y:num_rows,  jf:num_cols)  and  v 
requires  2{m  -  j){n  -  j)  flop.  Scaling  this  product  by  p  to  compute  w  requires  m  -  y  flop  and  is  disre- 
T  T 

garded.  Computing  vw  requires  {nt-j){n-j)  flop.  Subtracting  vw  from  A  requires  (m-  j)(n-  j) 
flop. 
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If  j  <  num_coIs  -  2 ,  there  are  additional  computations.  Computing  the  product  of 
A(y:num_rows,  j  +  1  :num_cols)  and  v  requires  2(m  -  j){n  -  j-\).  Scaling  this  product  by  P  requires 

T  T 

n-  j~  \  flop  and  is  disregarded.  Computing  wv  requires  (m  -  j){n  -  j-\).  Subtracting  wv  from  A 
requires  (m  -  j){n  -  j  -\).  The  total  flop  count  for  iteration  j  is  approximately  8(/n  -  y)(n  -  j) . 

The  total  flop  count  for  the  entire  mxn  can  be  computed  as  follows: 

n 

flop  count  =  ^  8(m  -  j)(n  -  j)  (125) 

y=i 


=  8  ^  [mn  -  (m  +  n)j  +  /] 


J  = 


j=\  j=\  j=\ 

=  8mn^  -  8(m  -K 

2  o 


=  Smn^  -4mn^  —  4n^  +  j(2n^  +  3n^  +  n) 


=  4nin^  -  4n  + 

3  3 


3  3 


We  then  drop  the  lower  order  terms  to  arrive  at  the  canonical  estimated  flop  count  for  real  matrices: 


flop 


count  =  4mn 


2 


(126) 


To  compute  the  left  and  right  singular  vectors,  we  will  need  to  accumulate  the  Householder  reflec¬ 
tions  in  Ug  and  V g .  The  bidiagonal  matrix  B  is  computed  from  the  input  matrix  A  through  the  following 
Householder  reflections: 


^  ~  QpK.n’'’Qpn,\^Q\ 


5  post,  I 


-post,  n-2 


(127) 


where  Qp^  j  is  the  Householder  reflection  matrix  (by  which  we  premultiply  A  )  used  to  zero  out  column  i , 
and  Cpost.  i  is  the  Householder  reflection  matrix  (by  which  we  postmultiply  A )  used  to  zero  out  row  i ,  and 
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n  is  the  number  of  columns  in  A .  We  accumulate  all  of  the  j  reflection  matrices  in  Ug,  and  all  of  the 
2post,i  reflection  matrices  in  Vg: 

-  i2pte,n"-i2pre,  1  O'"  -  ^pte,  I  •  •  •  Gpre. n  (128) 

V'i,  =  GposU-epost.n-2  (129) 

First,  consider  the  computation  of  Ug.  Each  matrix  Qp^j  €  (R.'"’'"*  has  the  form 


j 

m-j  (130) 

j  m-j 

where  Ij  is  the  j  x  J  identity  matrix  and 

=  (131) 

for  iteration  j . 

Noting  that  the  Qprc.j  portion  of  Gprej  shrinks  as  j  increases,  we  can  reduce  the  workload  neces¬ 
sary  to  compute  Ug  if  we  accumulate  this  product  starting  with  Cpre,  n  (^®  Householder  transformation 


matrix  with  the  smallest  non-identity  submatrix)  rather  than  starting  with  j  (the  Householder  trans¬ 

formation  with  the  largest  non-identity  submatrix):  instead  of  accumulating  a  product  that  is  almost  com¬ 
pletely  non-identity  from  the  beginning,  we  will  slowly  grow  the  submatrix  that  is  non-identity. 

Let 

n 

*  =  y 

(132) 

Then 

p  .  -  n  .  p 

prc.j*J  ^pre,j-M  pre,j 

(133) 

and 

Ub  =  V.1 

(134) 

If  we  consider  the  non-identity  portion  of  the  product  Ppn,}-i  ~  Gpre,  j-i^pre,  j » 


'Pre.J 


0 

Gpre,jJ 
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(135) 


^pre.J-l  ~  2pre,j-l'^| 


pre.j 


where 


P 


pre.j 


0  P 


0 

Pre.i 


(136) 


Given  that  the  Householder  reflection  matrix  Cpre,  j-i  is 

ep,r.j-i  =  /-Pvv"^  (137) 

we  can  express  the  product  in  Equation  135  thusly: 

?prc.j-l  =  (/-Pvv'')Pp„,j  (138) 

~  •^pte.j~P''^  ^ pre.j 

Let 

w  =  (139) 

Then 

?pre.j-l  =  V.j-PVH'  (140) 

Computing  the  product  >v  =  v  /’pre.j  requires  2(m  -  j)  flop.  Scaling  w  by  P  requires  m-  j  flop 

and  is  disregarded.  Computing  pvw  requires  (m  -  j)  flop.  Computing  /’pre.j  “  P‘'^  also  requires 

2  2 
(m  -  jf)  flop,  for  a  total  of  4(m  -  j)  flop  for  iteration  j . 

The  total  flop  count  to  accumulate  can  be  computed  as  follows: 
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flop  count  =  ^  4(m  -  j)^ 
y=  1 

n 

=  4£(m^-2mj  +  /) 
j=  1 


(141) 


.2 


=  4m  n  -  8m  y  .;  +  4  / 

7=1  7=1 

=  4m'»  -  -H  )(2n  ti.) 

2  6 


=  4m^n  -  4mn^  -  4mn  +  +  2n^  +  ~n 

3  3 


We  then  drop  the  lower  order  terms  to  arrive  at  the  estimated  flop  count  for  accumulating  U g  for  real 


matrices*: 


2  2  4-3 

flop  count  =  4m  n  -  4mn  +  -n 


(142) 


We  can  use  a  similar  process  to  compute  Vg.By  accumulating  the  product  Vg  =  Gp<,s,  | . . .  Qp^st.  n-2 
from  Qpoit,  n-2  progressing  backward,  we  can  minimize  the  workload  for  computing  V g . 

Let 

n-2 

=  nei-.u  <’«> 


'  =  7 


Then 


and 


^pOSl,j-l  2poSt,j- Impost,  j 


-  ^posi.  1 


(144) 


(145) 


If  we  consider  the  non-identity  portion  of  the  product  =  Gpost.  j-i  ^post,  j  * '''® 

^post,j-l  ~  Gpost j-l^posi,j  (146) 


1.  This  estimated  flop  count  differs  from  those  in  the  literature.  See  Appendix  A  for  more  details. 
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where 


P 


post,  j 


/  0 

P  v..i 


(147) 


Given  that  the  Householder  reflection  matrix  Gpost,j-i  ‘S 

gpostj-i  =  /-Pvv"^  (148) 

we  can  express  the  product  in  Equation  146  thusly: 

?post,j-l  =  (/-pvv^)Ppo,,j  (149) 

~  ^post,  j  “  ^post,j 

Let 

w'  =  v^^postj  (150) 

Then 

^post,j-l  ~  ^post,j  ~  P'^*^  (151) 

Computing  the  product  w  =  v  Ppost,  j  requires  2{n-j)  flop.  Scaling  w  by  P  requires  n-j  flop 

and  is  disregarded.  Computing  Pvw  requires  (« -  j)  flop.  Computing  Ppost.j  -  also  requires 

2  2 
(n  -  J)  flop,  for  a  total  of  4(n  -  J)  flop  for  iteration  J . 

The  total  flop  count  to  accumulate  Vg  can  be  computed  as  follows: 
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(152) 


71-2 


•  x2 


flop  count  =  ^  4(n  -  j} 
y=i 


71-2 


=  4  5;(n''-2n;  +  /) 
7=1 


/I  -2  n  -2 


.2 


=  4n  (n-2)-Sn'^j  +  4^j 

7=1  7=1 

=  4n' - 8n' - +  4(n-2)(n- l)(2n-3_) 
2  6 

=  4n^  —  8n^-  (4n^  -  \  2n^  +  8n)  +  -  9n^  +  13n  -  6) 


4  3  -  2  2  . 

=  -n  -2n  +-n-4 


We  then  drop  the  lower  order  terms  to  arrive  at  the  canonical  estimated  flop  count  for  accumulating  Vg  for 
real  matrices: 

flop  count  =  (153) 

The  rest  of  the  work  in  the  SVD  is  performed  in  a  series  of  Golub- Kahan  SVD  steps  ([4]).  This  algo¬ 
rithm  takes  as  an  input  a  bidiagonal  matrix  B  e  that  has  no  zeros  on  its  diagonal  or  superdiagonal^ 

and  overwrites  it  with  the  bidiagonal  matrix  B  =  U^BV ,  where  U  and  V  are  orthogonal.  A  MATLAB 
pseudo-code  implementation  of  the  Golub-Kahan  SVD  step  is  shown  below  in  Figure  25. 


2.  The  diagonal  elements  of  a  matrix  are  those  elements  whose  row  and  column  indices  are  equal;  the  superdiagonal 
elements  of  a  matrix  are  those  elements  whose  column  index  is  exactly  one  greater  than  its  row  index. 
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Let  \i  be  the  eigenvalue  of  the  trailing  2-by-2  submatrix  of  7  =  B  that 
is  closer  to 


z  =  r,2 

for  k  =  l:n  -  1 


Determine  c  =  cos(0)  cind  s  =  sin(0) 


such  that 


C  S 

s  c 


=  [* *o] 


Bi-BG(k,k-^\.B) 
y  =  ^kk 

^  =  ^k+  I.* 

Determine  c  =  cos(0)  and  s  =  sin(0)  such  that 


- 

T 

-- 

c  s 

y 

-5  C 

_z_ 

B<-d{k,k+\.e/B 
if  (k  <  n  -  1) 
>’  =  h.k*i 
Z  =  ^k.k  +  2 
end  %  if 
end  %  for  k 


0 


Figure  25:  MATLAB  pseudo-code  for  the  Golub-Kahan  SVD  step  ([4]) 


We  can  determine  c  =  cos (6)  and  s  =  sin(0)  by  using  the  givens  algorithm  shown  in  Figure  15. 

T 

Also,  we  can  update  B  with  either  BG(k,  it  +  1,  0)  or  G(k,  /:  +  1,  9)  B  without  explicitly  forming  the 
rotation  matrix  by  using  the  algorithms  shown  in  Figure  17  or  Figure  16,  respectively. 

The  bulk  of  the  workload  is  found  in  the  for  loop.  For  each  k,  we  need  to  perform: 

•  ten  flop  and  two  square  roots  for  two  calls  to  the  givens  algorithm 

•  12  flop  to  postmultiply  B  by  G(k,  k+  1,0) 

•  12  flop  to  premultiply  B  by  G(k,  it  +  1,  0)^ 
for  a  total  of  34  flop  and  two  square  roots. 

For  all  k  ranging  from  1  to  n  -  1,  we  will  need  to  perform  34(n  -  1 )  flop  and  2(n  -  1 )  square 
roots,  which  we  round  to  30n  flop  and  2n  square  roots  to  match  the  text  [4]. 

If  we  want  to  accumulate  the  Givens  rotations  by  updating  input  orthogonal  matrices  U  and  V  with 
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UC}\...Cjn-  \  and  VG|...G„_  1 ,  respectively,  we  will  require  6m  flop  to  apply  each  Givens  rotation  to  U 

and  6n  flop  to  apply  each  Givens  rotation  to  V .  For  all  n  -  1  iterations,  we  will  require  6m(n  -  1 )  flop,  or 

2 

approximately  6mn  flop,  to  accumulate  U ,  and  6n(n  -  1 )  flop,  or  approximately  6n  flop,  to  accumulate 
V. 

A  MATLAB  pseudo-code  implementation  of  the  overall  algorithm  for  computing  the  SVD  of  a  real 

T 

matrix  A ,  which  overwrites  A  with  its  singular  values  U  AV  =  D  +  E,  where  E  satisfies  l|£|l2  =  «l|All2 
and  u  is  the  unit  roundoff,  is  given  below  in  Figure  26  (e  is  a  small  multiple  of  the  unit  roundoff). 


%  Bidiagonalize  the  input  matrix. 

Use  the  algorithm  in  Figure  24  to  compute  the  bidiagonal i ration 

fl  =((/,. ..t//>4(F,...V„.2) 


until  q  =  n 

set  to  zero  if  K,>,| +  jt,,.,  ,^ ,|)  for  any  ie(l,n-l) 


find  the  largest  q  and  the  smallest  p  such  that,  if 


B  = 


0  ^22  ^ 


0  0 
( 

B 


33 


P 

n-p-q 


p  n-p-q  q 

then  ^33  is  diagonal  and  B^i  has  a  non- zero  superdiagonal 


if  q  <  n 

if  any  diagonal  entry  in  ^22  is  zero 

zero  the  superdiagonal  entry  in  the  same  row 
else 

apply  the  algorithm  in  Figure  25  to  B21 

B  diag(/p.  U,  BdiAgUp,  V,  I^) 

end  %  if  any  diagonal  entry  in  B22  zero 

end  %  if  q  <  n 
end  %  until  q  =  n 


Figure  26:  MATLAB  pseudo-code  for  a  real  SVD  ([4]) 
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The  computational  workload  for  the  real  SVD  is  found  principally  in  the  bidiagonalization  of  the 

2  4  3 

input  matrix,  which  requires  approximately  4m/i  -  -n  flop.  There  are  approximately  2n  calls  to  the 

Golub-Kahan  SVD  step  ([3]).  each  with  a  30n  flop  count.  Therefore,  the  flop  count  for  the  Golub-Kahan 
2 

SVD  steps  will  be  60n  .  As  it  is  an  order  smaller  than  the  flop  count  for  the  bidiagonalization,  this  flop 
count  is  ignored  in  the  overall  flop  count,  giving  us  the  canonical  flop  count  for  the  real  SVD: 


2  4  3 

flop  count  =  4mn  --n 


(154) 


It  should  be  noted  here  that  the  actual  number  of  iterations  through  the  algorithm  in  Figure  26 
needed  to  converge  on  a  solution  is  not  deterministic.  We  estimate  this  number  of  iterations  to  be  approxi¬ 
mately  2n ,  but  it  may  be  quite  different,  depending  on  the  input  matrix  and  £ . 

If  we  want  to  accumulate  the  left  singular  vectors  U,  we  will  need  to: 

•  accumulate  Ug  during  the  bidiagonalization  of  the  input  matrix,  at  a  cost  of 

2  2  4  3 

4m  n-4mn  +  -n 

•  accumulate  the  Givens  rotations  during  each  of  the  0{n)  Golub-Kahan  SVD  steps  at  a  cost 
of  6mn  flop  j>er  iteration.  If  we  assume  that  we  will  require  2n  Golub-Kahan  SVD  steps, 

2  3 

accumulating  the  Givens  rotations  will  require  a  total  of  approximately  12mn  flop 

If  we  add  these  two  items  to  the  workload  for  the  SVD  algorithm  without  accumulating  U,  we  can 
compute  the  canonical  expression  for  the  total  workload: 


243  2  243  2 

flop  count  =  4mn  -  +  4m  n  -  4mn  +  -n  +  1 2mn 


(155) 


2  2 
=  4m  n  +  12mn 

Similarly,  if  we  want  to  accumulate  the  right  singular  vectors  V,  we  will  need  to: 

4  3 

•  accumulate  Vg  during  the  bidiagonalization  of  the  input  matrix,  at  a  cost  of  -n 

•  accumulate  the  Givens  rotations  during  each  of  the  0(n)  Golub-Kahan  SVD  steps  at  a  cost 

2 

of  6n  flop  per  iteration.  If  we  assume  that  we  will  require  2n  Golub-Kahan  SVD  steps, 
accumulating  the  Givens  rotations  will  require  a  total  of  approximately  12n^  flop'* 


3.  This  estimated  flop  count  differs  from  those  in  the  literature.  See  Appendix  A  for  more  details. 

4.  This  estimated  flop  count  differs  from  those  in  the  literature.  See  Appendix  A  for  more  details. 
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If  we  add  these  two  items  to  the  workload  for  the  SVD  algorithm  without  accumulating  V,  we  can 
compute  the  canonical  expression  for  the  total  workload: 

flop  count  =  4mn^-|n^  +  |n^+ 12n^  (156) 

=  4mn^  +  12n^ 

Finally,  we  can  determine  the  canonical  expression  for  the  workload  for  the  SVD  algorithm  includ¬ 
ing  accumulating  both  U  and  V : 

flop  count  =  4mn^  -  +  4m^n  -  4mn^  +  +  \2mn^  +  +  I2n^  (157) 

.  2  .  2  40  3 

=  4m  n  +  12mn  + 

=  4m^rt  +  12mn^  +  13n^ 

7.2  COMPLEX  SINGULAR  VALUE  DECOMPOSITION 

The  SVD  of  a  complex  matrix  A  €  ,  where  m^n,  computes  unitary  matrices  U  e  Cf^’"  and 

V  e  such  that: 

U'^AV  =  diag(o„a2,  ...,o„)  (158) 

where 

(159) 

The  scalars  Oj  are  the  singular  values  of  A  .  The  matrix  U  contains  the  left  singular  vectors,  while 
the  matrix  V  contains  the  right  singular  vectors. 

As  the  first  step  in  the  implementation  of  the  SVD  algorithm,  we  upper  bidiagonalize  the  input 
matrix.  The  bidiagonal  matrix  B  is  derived  from  the  input  matrix  A  through  the  application  of  unitary 
Householder  reflections  Ug  and  V^: 

=  Ug'^AVg  (160) 

A  MATLAB  implementation  of  the  complex  Householder  bidiagonalization,  which  overwrites  the  input 
matrix  with  the  bidiagonal  matrix  (if  Ug  and  are  desired,  they  must  be  separately  formed),  is  the  same 


B 

0 
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as  the  implementation  for  a  real  Householder  bidiagonalization,  which  was  given  in  Figure  24.  Even 
though  the  input  matrix  A  is  complex,  the  resultant  bidiagonal  matrix  B  is  real. 

In  computing  the  flop  count  for  a  complex  Householder  bidiagonalization,  we  consider  the  applica¬ 
tion  of  the  Householder  reflections  only:  the  flop  count  for  the  computation  of  the  Householder  vector  v  is 
ignored. 

For  iteration  j,  computing  the  product  of  the  transpose  of  i4(7:num_rows,  y:num_cols)  and  v 

requires  8(m  -  j){n  -  j)  flop.  Scaling  this  product  by  P  to  compute  w  requires  2(m  -  j)  flop  and  is  dis- 

T  •  .  T 

regarded.  Computing  vw  requires  6(m  -  j){n  -  j)  flop.  Subtracting  vw  from  A  requires 

flop. 

If  i  <  num_cols  -  2 ,  there  are  additional  computations.  Computing  the  product  of 
i4(y:num_rows,  j  +  1  :num_cols)  and  v  requires  8(/n  -  j){n  -  j-\).  Scaling  this  product  by  P  requires 

2(n  -J-i)  flop  and  is  disregarded.  Computing  wv  requires  6(m  -  j)(n  -  J  -  1).  Subtracting  wv  from 
A  requires  2(m  -  j)(n  The  total  flop  count  for  iteration  J  is  approximately  32(m  -  j)(n  -  j) ,  or 

four  times  the  flop  count  for  the  real  Householder  bidiagonalization.  The  total  estimated  flop  count  for  the 
complex  Householder  bidiagonalization  is  therefore 


(161) 


To  compute  the  left  and  right  singular  vectors,  we  will  need  to  accumulate  the  Householder  reflec¬ 


tions  in  Ug  and  V^.  The  bidiagonal  matrix  B  is  computed  from  the  input  matrix  A  through  the  following 
Householder  reflections: 


(162) 


where  j  is  the  Householder  reflection  matrix  (by  which  we  premultiply  A  )  used  to  zero  out  column  i , 
and  j2post,  i  Householder  reflection  matrix  (by  which  we  postmultiply  >4 )  used  to  zero  out  row  i ,  and 
n  is  the  number  of  columns  in  A  .  We  accumulate  all  of  the  j  reflection  matrices  in  Ug,  and  all  of  the 

Qpost  i  reflection  matrices  in  V^: 


(163) 


(164) 


F.ach  matrix  Gprej  has  the  form 
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SpfC.j 


0 


J 

m-j 


(165) 


0  Qp^.i 


J  m-j 


where  Ij  is  the  j  x  j  identity  matrix  and 


(166) 


for  iteration  j . 

Noting  that  the  Qprej  portion  of  j  shrinks  as  j  increases,  we  can  reduce  the  workload  neces¬ 
sary  to  compute  if  we  accumulate  this  product  starting  with  „  (the  Householder  transformation 
matrix  with  the  smallest  non-identity  submatrix)  rather  than  starting  with  j  (the  Householder  trans¬ 
formation  with  the  largest  non-identity  submatrix):  instead  of  accumulating  a  product  that  is  virtually  com¬ 
pletely  non-identity  from  the  beginning,  we  will  slowly  grow  the  submatrix  that  is  non-identity. 


Let 


n 


(167) 


Then 


(168) 


and 


(169) 


If  we  consider  the  non-identity  portion  of  the  product  PpK,}-\  =  Cpre,  j-i^pre,  j » 6ave 

^pre,j-l  ~  0pre,j-l^pre,j 


(170) 


where 


(171) 


Given  that  the  Householder  reflection  matrix  Gpre.j-i  is 
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(172) 


H 


Gprc.j-1  =  /-Pvv 

we  can  express  the  product  in  Equation  170  thusly; 

=  (/-pVv")V.j 


(173) 


“  ^pre.j  P'^''  ^pre,j 


Let 


w  =  V  P, 


Prej 


Then 


^prej-l  ~  ^prc»j 


(174) 


(175) 


//“  2 

Computing  the  product  w  =  v  ^prcj  requires  8(m  -  j)  flop.  Scaling  w  by  P  requires  2(m  -  j) 

flop  and  is  disregarded.  Computing  pvw  requires  6(m-j)  flop.  Computing  /’pre.j“P''w  requires 

2(m-  j)  flop,  for  a  total  of  16(m-  j)  flop  for  iteration  j,  or  four  times  the  flop  count  for  the  real 
Householder  bidiagonalization.  The  estimated  flop  count  for  accumulating  U g  for  complex  matrices  is 
therefore 


flop  count  =  \6m^n  -  I6mn~  + 


(176) 


By  accumulating  the  product  Vg  =  Cpost,  i  •  •  •  Cposi,  n-2  6posi.n-2  ^”*1  progressing  backward, 

we  can  minimize  the  workload  for  computing  Vg . 

Let 

n-2 

v..j  = 


>=j 


Then 


and 


^post.j-l  Spost.j- Impost,  j 


-  Ppoa,  I 


(178) 


(179) 


If  we  consider  the  non-identity  portion  of  the  product  /’post.j-i  =  2post.j- impost,  j  •  have 
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(180) 


where 


^post.}-I  ~  Gpost.j- Impost, j 


posi.j 


I  0 
OP, 


post.j 


(181) 


Given  that  the  Householder  reflection  matrix  fiposi.  j-i  is 

fipost.j-1  =  /-pvv" 

we  can  express  the  product  in  Equation  180  thusly: 

^pos..j-l  =  (/-pvv")Pp^,j 


(182) 


(183) 


=  P, 


//ri 


post. 


j-pvv"P, 


post,  j 


Let 


w  =  V  P, 


post.j 


Then 


^post.j- 1  ^post.j  Pvw 


(184) 


(185) 


Computing  the  product  w  =  v  Ppost.j  requires  8(«-  j)  flop.  Scaling  w  by  p  requires  2(n-  j) 

flop  and  is  disregarded.  Computing  Pvw  requires  6{n-j)  flop.  Computing  Ppost.j- P''»<'  requires 
2  2 

2(n  -  j)  flop,  for  a  total  of  16(n  -  j)  flop  for  iteration  j ,  or  four  times  the  flop  count  for  the  real  House¬ 
holder  bidiagonalization.  The  total  flop  count  to  accumulate  Vg  for  complex  matrices  is  therefore 


fi  .  16  3 

flop  count  =  —n 


(186) 


The  rest  of  the  work  in  the  SVD  is  performed  in  a  series  of  Golub-Kahan  SVD  steps.  This  algorithm 
takes  as  an  input  a  bidiagonal  matrix  B  e  that  has  no  zeros  on  its  diagonal  or  superdiagonal  and 

overwrites  it  with  the  bidiagonal  matrix  P  =  C/^PV,  where  U  and  V  are  orthogonal.  Because  the  bidiag¬ 
onal  matrix  P  is  real,  the  Givens  rotations  G  are  also  real,  and,  therefore,  the  algorithm  for  the  Golub- 
Kahan  SVD  steps  that  was  given  in  Figure  25  may  be  used  here  ([9]).  The  workload  for  this  algorithm,  as 
was  previously  indicated,  is  approximately  30n  flop  and  2n  square  roots. 


If  we  want  to  accumulate  the  Givens  rotations  by  updating  input  unitary  matrices  U  and  V  with 
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U(j\...Cj„^\  and  VG,...G„_|,  respectively,  we  will  require  6m  flop  to  apply  each  Givens  rotation  to 
each  of  the  real  and  imaginary  halves  of  U ,  and  6n  flop  to  apply  each  Givens  rotation  to  the  real  and 
imaginary  halves  of  V.  For  all  n  -  1  iterations,  we  will  require  12m(n  -  1 )  flop,  or  approximately  12mn 

flop,  to  accumulate  f/,and  12/i(n-l)  flop,  or  approximately  12n  flop,  to  accumulate  V. 

A  MATLAB  pseudo-code  implementation  of  the  overall  algorithm  for  computing  the  SVD  of  a  com- 
plex  matrix  A ,  which  overwrites  A  with  its  singular  values  U  AV  =  D  +  E,  is  given  below  in  Figure  27 
(e  is  a  small  multiple  of  the  unit  roundoff). 


%  Bidiagonalize  the  input  matrix. 

Use  the  algorithm  in  Figure  24  to  compute  the  bidiagonalization 


until  q  =  n 

set  to  zero  if  |\,  +  ,| Se(|fc,.,.| +  |fc,^ ,  for  any  /e(l,n-l) 


find  the  largest  q  and  the  smallest  p  such  that,  if 


B  = 


^22 


B 


33 


P 

n-p-q 

Q 


p  n^p^q  q 

then  ^33  is  diagonal  and  B22  has  a  non- zero  superdiagonal 


if  q  <  n 

if  any  diagonal  entry  in  B22  is  zero 

zero  the  superdiagonal  entry  in  the  same  row 
else 

apply  the  algorithm  in  Figure  25  to  ^22 

B  <-  diag(/p,  U.  /,^„.,)"Bdiag(/p,  V,  /,) 

end  %  if  any  diagonal  entry  in  B22  is  zero 

end  %  if  q  <  n 
end  %  until  q  =  n 


Figure  27:  MATLAB  pseudocode  for  a  complex  SVD 
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The  computational  workload  for  the  complex  SVD  is  found  principally  in  the  bidiagonalization  of 

the  input  matrix,  which  requires  approximately  16mn^  -  flop.  There  are  approximately  2n  calls  to 

the  Golub-Kahan  SVD  step  ([3]),  each  with  a  30n  flop  count.  Therefore,  the  flop  count  for  the  Golub- 

2 

Kahan  SVD  steps  will  be  60/i  .  As  it  is  an  order  smaller  than  the  flop  count  for  the  bidiagonalization,  this 
flop  count  is  ignored  in  the  overall  flop  count,  giving  us  the  flop  count  for  the  complex  SVD: 

flop  count  =  16m/i^  —  '^n^  (187) 

If  we  want  to  accumulate  the  left  singular  vectors  U ,  we  will  need  to: 

•  accumulate  Ug  during  the  bidiagonalization  of  the  input  matrix,  at  a  cost  of 

ISrn^n  -  \6mn^  + 

•  accumulate  the  Givens  rotations  during  each  of  the  0{n)  Golub-Kahan  SVD  steps  at  a  cost 
of  12mn  flop  per  iteration.  If  we  assume  that  we  will  require  2n  Golub-Kahan  SVD  steps, 

2 

accumulating  the  Givens  rotations  will  require  a  total  of  approximately  24mn  flop 

If  we  add  these  two  items  to  the  workload  for  the  SVD  algorithm  without  accumulating  U ,  we  can 
compute  the  total  workload: 

flop  count  =  \6mn^  +  I6m^n- I6mn^  +  +  24mn^  (188) 

=  I6m^n  +  24mn^ 

Similarly,  if  we  want  to  accumulate  the  right  singular  vectors  V ,  we  will  need  to: 

•  accumulate  V g  during  the  bidiagonalization  of  the  input  matrix,  at  a  cost  of  •y 

•  accumulate  the  Givens  rotations  during  each  of  the  0{n)  Golub-Kahan  SVD  steps  at  a  cost 

2 

of  1 2n  flop  per  iteration.  If  we  assume  that  we  will  require  2n  Golub-Kahan  SVD  steps, 

accumulating  the  Givens  rotations  will  require  a  total  of  approximately  24n  flop 

If  we  add  these  two  items  to  the  workload  for  the  SVD  algorithm  without  accumulating  V ,  we  can 
compute  the  total  workload: 
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(189) 


flop  count  =  1 6mn^  -  -y  +  -y  +  24n^ 

=  \6mn^  +  24/j^ 

Finally,  we  can  determine  the  workload  for  the  SVD  algorithm  including  accumulating  both 
V: 


flop  count  =  16mn^  -  +  ISni^n  -  I6mn^  +  +  24mn^  +  +  24n^ 

=  I6/n^/i  +  24m/i^  +  y/i^ 

=  16/n^n  +  16mn^  +  29«^ 


U  and 

(190) 
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8.  SUMMARY 


The  workload  expressions  for  the  real  and  complex  signal  processing  kernels  are  summarized  below. 


Table  1:  Flop  Count  Summary 


Signal  Processing  Kernel 

Computational  Complexity 

Real  Input 

Complex  Input 

matrix-matrix  multiplication 

2mnp 

Smnp 

fast  Fourier  transform 

^nlogj/i 

5nlog2n 

Householder  QR  decomposition 

forward  or  back  substitution 

2 

n 

4n^ 

eigenvalue  decomposition:  eigenvalues  only 

4  3 

r 

16  3 

T" 

eigenvalue  decomposition:  eigenvalues  and 
eigenvectors 

9n 

23n 

singular  value  decomposition:  singular  values 
only 

A  2  4  3 

4mn  --n 

..  2  16  3 

lomn 

singular  value  decomposition:  singular  values 
and  left  singular  vectors 

Am'  n  +  \2mri' 

2  2 

16m  n  +  24mn 

singular  value  decomposition:  singular  values 
and  right  singular  vectors 

4mn^  +  \  2n^ 

\6mr^  +  24n^ 

singular  value  decomposition:  singular  values, 
left  and  right  singular  vectors 

Atn'n  +  \2mn^  +  13n^ 

16m^n  -h  +  29n^ 

In  the  table  above,  the  parameters  in  the  computational  complexity  expressions  are; 

•  the  dimensions  of  the  two  multiplicands  -  mXn  and  n  x  p  -  for  the  matrix-matrix  multipli¬ 
cation 
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•  the  length  of  the  vector  n  for  the  fast  Fourier  transform 

•  the  size  of  the  triangular  system  n  for  forward  and  back  substitutions 

•  the  dimensions  of  the  input  matrix  mxn  for  the  Householder  QR  decomposition,  eigen¬ 
value  decomposition,  and  singular  value  decomposition. 
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APPENDIX  A.  DIFFERENCES  IN  FLOP  COUNTS 


Some  of  the  flop  counts  given  in  this  document  differ  from  those  found  in  the  literature.  These  differ¬ 
ences  are  discussed  in  greater  detail  here. 

A.1  BIDIAGONALIZATION  IN  THE  SVD 

In  Section  7.1,  we  estimated  flop  count  for  accumulating  Ug  for  the  bidiagonalization  of  a  real 
matrix  to  be 


(191) 


2  4  3 

Golub  and  Van  Loan  ([4])  estimate  this  flop  count  to  be  4m  ”  ~  j'*  •  This  expression  would  be  the 
estimated  flop  count  if  the  summation  given  in  Equation  125  was  instead 


n 


flop  count 


(192) 


n 


=  4m^n  -  4  ^  f" 


n(n+  l)(2n+  1) 
6 


The  different  flop  counts  for  accumulating  U g  for  the  bidiagonalization  of  a  real  matrix  are  summa¬ 
rized  below  in  Table  2. 
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Table  2:  Workload  to  Accumulate  IJg  During  the  Bidiagonalization  of  a  Real  Matrix 


Source 

Flop  Count 

Arakawa 

A  A  2.4  3 

4m  n-Amn  +  -n 

Chan 

«  2  2  3 

2mn  --n 

Golub  and  Van  Loan 

.2  43 

4m  n~-n 

A.2  ACCUMULATION  OF  THE  LEFT  SINGULAR  VECTORS  IN  THE  SVD 

In  Section  7.1,  we  estimated  the  flop  count  for  accumulating  U  during  the  Golub-Kahan  steps  for 
the  SVD  of  a  real  matrix  to  be 

flop  count  =  12mn^  (193) 

2  S3 

Golub  and  Van  Loan  ([4])  estimate  this  flop  count  to  be  4mn  +  -n  . 

2 

Chan  ([!])  estimates  this  workload  to  be  2mn  multiplies.  If  we  assume  that  there  is  one  addition 

2 

operation  for  every  multiplication  operation,  this  workload  would  be  equivalent  to  Amn  flop. 

The  different  flop  counts  for  accumulating  U  during  the  Golub-Kahan  steps  for  the  SVD  of  a  real 
matrix  are  summarized  below  in  Table  3. 

Table  3:  Workload  to  Accumulate  U  During  the  SVD  of  a  Real  Matrix 


Source 

Flop  Count 

Arakawa 

12mn^ 

Chan 

4mn^ 

Golub  and  Van  Loan 
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A.3  ACCUMULATION  OF  THE  RIGHT  SINGULAR  VECTORS  IN  THE  SVD 

In  Section  7. 1 ,  we  estimated  the  flop  count  for  accumulating  V  during  the  Golub-Kahan  steps  for  the 
SVD  of  a  real  matrix  to  be 

flop  count  =  12n^  (194) 

3 

Golub  and  Van  Loan  ([4])  estimate  this  flop  count  to  be  8/j  . 

3 

Chan  ([1])  estimates  this  workload  to  be  2n  multiplies.  If  we  assume  that  there  is  one  addition 

3 

operation  for  every  multiplication  operation,  this  workload  would  be  equivalent  to  4n  flop. 

The  different  flop  counts  for  accumulating  V  during  the  Golub-Kahan  steps  for  the  SVD  of  a  real 
matrix  are  summarized  below  in  Table  4. 

Table  4:  Workload  to  Accumulate  V  During  the  SVD  of  a  Real  Matrix 


Source 

Flop  Count 

Arakawa 

]2n^ 

Chan 

4n^ 

Golub  and  Van  Loan 

00 
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